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Abstract: -One of the difficult tasks on Natural Language 
Processing (NLP) is to resolve the sense ambiguity of 
characters or words on text, such as polyphones, homonymy, 
and homograph. The paper addresses the ambiguity issue of 
Chinese character polyphones and disambiguity approach for 
such issues. Three methods, dictionary matching, language 
models and voting scheme, are used to disambiguate the 
prediction of polyphones. Compared with the well-known MS 
Word 2007 and language models (LMs), our approach is 
superior to these two methods for the issue. The final precision 
rate is enhanced up to 92.75%. Based on the proposed 
approaches, we have constructed the e-learning system in 
which several related functions of Chinese transliteration are 
integrated. 

/feynwrfs:-Natural Language Processing, Sense Disambiguity, 
Language Model, Voting Scheme, 

I. Introduction 

In recent years, natural language processing (NLP) has 
been studied and discussed on many fields, such as machine 
translation, speech processing, lexical analysis, information 
retrieval, spelling prediction, hand-writing recognition, and 
so on [1][2]. In the computational models, syntax models 
parsing, word segmentation and generation of statistical 
language models have been the focus tasks. 

In general, no matter what kinds of natural languages, 
there will be always a phenomenon of ambiguity among 
characters or words in text, such as polyphone, homonymy, 
homograph, and the combination of them. It is of necessary 
to accomplish most natural language processing 
applications. One of the difficult tasks on NLP is to resolve 
the word's sense ambiguity. It is so-called word sense 
dsiambiguity (WSD) [3, 4]. 

Disambiguating the sense ambiguity can alleviate the 
problems in NLP. The paper address the dictionary 
matching, statistical V-gram language model (LMs) and 
voting scheme, which includes two methods: preference 
and winner-take-all scoring, to retrieve Chinese lexical 
knowledge, employed to process WSD on Chinese 
polyphonic characters. There are near 5700 frequent unique 
characters and among them more than 1300 characters have 
more than 2 different pronunciations, they are called 
polyphonic characters. The problem predicting correct 
polyphonic categories can be regarded as the issue of WSD. 

The paper is organized as following: the related works 
on WSD are presented in Section 2. Three methods will 
first be described in Section 3 and experimental results are 



shown and then analyzed furthermore in Section 4. 
Conclusions and future works are listed in last section, 
n. Related works 

Resolving automatically the word sense ambiguity can 
enhance the language understanding, which will used on several 
fields, such as information retrieval, document category, grammar 
analysis, speech processing and text preprocessing, and so on. In 
the past decades, ambiguity issues are always considered as 
Al-complete, that is, a problem which can be solved only by 
first resolving all the difficult problems in artificial 
intelligence (AI), such as the representation of common 
sense and encyclopedic knowledge. Sense disambiguation 
is required for correct phonetization of words in speech 
synthesis [13], and also for word segmentation and 
homophone discrimination in speech recognition. 

It is essential for language understanding applications 
suchas message understanding, man-machine 
communication, etc. WSD can be applied into many fields 
of natural language processing [10], such as machine 
translation, information retrieval (IR), speech processing 
and text processing. 

The approaches on WSD are categorized as follows: 

A. Machine-Readable Dictionaries (MRD): 

Relying on the word information in dictionary for sense 
ambiguity, such as WordNet or Academia Sinica 
Chinese Electronic Dictionary (ASCED) [17]. 

B. Computational Lexicons: 

Employing the lexical information in thesaurus, such as 
the well-known WordNet [11, 14], which contains the 
lexical clues of characters and lattice among related 
characters. 

C. Corpus-based methods 

Depending on the statistical results in corpus, such as 
term's occurrences, part-of-speech (POS) and location 
of characters and words [12, 15]. 

D. Neural Networks: 

The approach is based on the concept codes of thesaurus 

or features of lexical words [16, 17]. 
There are many works addressing WSD and several 
methods have been proposed so far. Because of the unique 
features of Chinese language-Chinese word segmentation, 
more than two different features will be employed to 
achieve higher prediction for WSD issues. Therefore, two 
methods will be arranged furthermore. 
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III. Description of Proposed Methods 

In this paper, several methods are first proposed to 
disambiguate the sense category of Chinese polyphones; 
dictionary matching, n-gram language models and voting 
scheme. In the following, each will be explained in details. 
A. Dictionary Matching 

In order to predict correctly the sense category of 
polyphones, dictionary matching will be exploited for the 
ambiguity issue. Within a Chinese sentence, the location p 
of polyphonic character w p is set as the centre, we extract 
the right and left substring based on the centre p. Two 
substrings are denoted as CH L and CH R , as shown in Fig. 1. 
In a window size, all possible substrings in CH L and CH R 
will be segmented and then match the lexicons in 
dictionary. 

w, ,w? (CH L ) w„ 

w,,w 2 w, w p+] ,w p+2 w„ 

w, w„ +1 ,w„ +2 (CH R ) w„ 

Fig. 1: A sentence with target polyphonic character w p . will 
divided into two substrings. 

If the words are existed on both substrings, then we 
can decide the pronunciation of polyphone based on the 
priority of longest word and highest frequency of word; 
length of word first and then frequency of word secondly. 
In the paper, window size=6 Chinese characters; that means 

LEN(Ctf L )= LEN(CH R )=6 

The Chinese dictionary is available and contains near 
130K Chinese words. Each Chinese word may be 
composed from 2 to 12 Chinese characters. All the words 
in dictionary contain its frequency, part-of-speech (POS), 
transliteration 1 ; in which correctly pronunciation for 
polyphonic character in the word may be decided. 

The algorithm of dictionary matching is described as 
follows: 

step 1. Read in the sentence and find the location p of 

polyphone target w p . 
step 2. Based on the of w p , all the possible substring of CH L 

and CH R within window (size=6) will be segmented 

and extracted, then compared with lexicons in 

Chinese dictionary, 
step 3. If any Chinese word can be found on both substring 
goto step 4, 

else 
goto step 5. 

step 4. Decide the sense category of pronunciation for 
polyphone based on the priority scheme of longest 
word and highest frequency of word. Then the 
process ends. 

step 5. The pronunciation of polyphone w p will be predicted 



Zhuyin Fuhau ffiiSt) can be found in the dictionary. 



by methods in the following phase. 

B. Language Models - LMs 

In recent years, the statistical language models have 
been adopted in NLP Supoosed that W=w,w 2 ,w 3 ,...w n , 
where w, and n denote the the i th Chinese character and 

number of characters in sentence (0 < t < n) 

P(W)=P(w i ,w 2 ,w„ ), //using chain rules. 

P( Wl n )= P(w,)P(w 2 \w,)P{w 3 \wl)...P{w„\w^- 1 ) 

=nLiPK|wf- 1 ) (i) 

where w^ -1 denotes string Wi,w 2 ,w 3 ,...w k .]. 

In Eq(l), the probability P(w k |w^ _1 ) can be 
calculated, starting at w t , by using v/i,w 2 ,w 3 _,w k .i substring 
to predict the occurrence probability of w k . In case of longer 
string, it is necessary for large amount of corpus to train the 
language model with better performance. It will lead to 
spending much labor and time extensive. 

In general, unigram, bigram and trigram (3<=N) [5] [6] 

are generated. Af-gram model calculates probability P( . ) of 
N ch events by the preceding N-l events, rather than string 

W2,W 2 ,W 3 W N .]. 

In short, A'- gram is so-called AM) th -order Markov model, 
which calculate conditional probability of successive events: 
calculate the probability of N lh event while preceding (A^-l) 
event occurs. Basically, Af-gram Language Model is 
expressed as follows: 

PW)~m=iP(y k H-N + i) (2) 

N=l, unigram or zero-order markov model. 
N=2, bigram or first-order markov model. 
N=3, trigram or second-order markov model. 

In Eq(2), the relative frequency will be used for 
calculating the P( . ): 

P(,w n \w^Zh +1 )= ^h lW ;\ (3) 

where C(w) denotes the count of event w occurring in 
training corpus. 

In Eq(3), the obtained probability P( . ) is called 
Maximum Likelihood Estimation (MLE). While predicting 
the pronunciation category of polyphones, we can predict 
based on the probability on each category t (1< t < T), T 
denotes the number of categories of polyphone. The 
category with maximum probability P max (W) with respect to 
the sentence W will be the target and then the correct 
pronunciation of polyphone can be decided. 

C. Voting Scheme 

In contrast to the Af-gram models above, we proposed 
voting scheme with similar concept for use to select in 
human being society. Basically, we vote for one candidate 
and the candidates with maximum votes will be the winner. 
In real world, maybe more than one candidate will win the 
section game while disambiguation process only one 
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category of polyphone will be the final target with respect 
to the pronunciation. 

The voting scheme can be described as follows: each 
token in sentence play the voter for vote for favorite 
candidate based on the probability calculated by the lexical 
features of tokens. The total score S(W) accumulated from 
all voters for each category will be obtained, and the 
candidate category with highest score is the final winner. In 
the paper, there are two voting methods: 

1) Winner-Take- All: 

In the voting method, the probability is calculated as 
follows: 



Table 1 : example for two scoring scheme of voting. 



P{w t ) = 



C(Wj,t) 



(4) 



where C(w,) denotes the occurrences of w, in training 
corpus, and C(w„ t) denotes the occurrences of W; for sense 
category t in training corpus. 

In Eq(4) above, P(wj) is regarded as the probability 
of W; on category t. In winner take all scoring, the 
category with maximum probability will win the ticket. On 
the other hand, it win one ticket (1 score) while all other 
categories can't be assigned any ticket (0 score). Therefore, 
each voter has just one ticket for voting. The 
winner-take-all scoring for tolen w t can be defined as 
follows: 

p ( -j f 1 if Pt( w i) = max. among all categories T ... 

,W "[0 all other categories () 

According to Eq(5), the total score for each categories 
can be accumulated for all tokens in sentence: 

S(W) =P( Wl )+P(w 2 )+P(w 3 )+ +P(w n ) 

=BU (6) 

2) Preference Scoring: 

Another voting method is called as preference. For a 
token in sentence, the summation of the probability for all 
the categories of a polyphone character will be equal to 1. 
Let us show an Chinese's example (El) for two voting 
methods. Note that sentence (El') is the translation for 
example (El). As presented in Table 1, the polyphone 

character % has three different pronunciations, 1 . H U^', 

2. H U I 7" and 3. < U T2 i ' . Supposed that the occurrence 

of token 6 & (blank examination) in these phonetic 
categories are 26, 11 and 3, total occurrence is 40. 
Therefore, the score for each category by two scoring 
methods can be calculated. 

«*tt*Sffiffl*7a« (El) 
Government handed over a blank examination paper in 
education and society. (El ') 



category 


count 


preference 


w-t-all 


1 H Lm 4 


26 


26/40=0.65 


40/40=1 


2 H 3 


11 


11/40=0.275 


0/40=0 


3 < 2 


3 


3/40=0.075 


0/40=0 


Total £ C() 


40 


1 score 


1 score 



ps. w-t-all denotes winner-take-all scoring 

D. Unknown events-Zero Count Issue 

In certain cases, C(*) of a novel (unknown word), 
which don't occur in the training corpus, may be zero 
because of the limited training data and infinite language. It 
is always hard for us to collect sufficient datum. The 
potential issue of MLE is the probability for unseen events 
is exactly zero. This is so-called the zero-count problem and 
will degrade the performance of system. 

It is obvious that zero count will lead to the zero 

probability of P(') in Eqs(2), (3) and (4). There are many 
smoothing works in [7, 8, 9]. The paper adopted the 
additive discounting for calculating P as follows: 

(7) 

where S denotes a small value (<5<=0.5); which will be 
added into all the known and unknown events. The 
smoothing method will alleviate the zero count issue in 
language model. 

E. Classifier-Predicting the Categories 

Supposed that polyphone has T categories, 1< t < T, 
how can we predict the correct target t ? As shown in 
Eq(8), the category with maximum probability or score will 
be the most possible target: 



t=argmax t P,(W), or 
i=argmax t S, (W), 



(8) 



where P, (W) is the probability of W in category t, which 
can be obtained from Eq(l) for LMs and S t (W) is the total 
score based on the voting scheme from Eq(6). 

IV. Experiment Results 

In the paper, 10 Chinese polyphones are selected 
randomly from more than 1300 polyphones in Chinese. All 
the promising pronunciations of these selected polyphones 
are list in Table 2; one polyphone "W" has 5 categories, 3 
polyphone have 2 categories. 
A. Dictionary and Corpus 

Academic Sinica Chinese Electronic dictionary, 
ASCED) contains more than 130K Chinese words, 
composing of 2 to 11 characters. The word in ASCED is 
with Part-of- speech (POS), frequency and pronunciation for 
each character. 

The experimental data are collected from the corpus of 
ASBC (Academia Sinica Balanced Corpus) and web news 
of China Times. The sentences with one of 10 polyphones 
are collected randomly. There are totally 9070 sentences, 
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which are divided into two parts: 8030 (88.5%) and 1040 

(11.5%) sentences for training and outside testing, 

respectively. 

B. Experiment Results 

Three LMs models are generated: unigram, bigram and 
trigram. Precision Rate (PR) can be defined as: 

NO. of correct prediction 



PR 



total number of sentence 



(9) 



Method 1 : Dictionary Matching 

There are 69 sentences processed by the word 
matching phase and 7 sentences are wrongly predicted. The 
average PR achieves 89.86%. 

In the followings, several examples are presented and 
explained the matching phase of dictionary matching: 

§H^[h]sI##*SAMJ1£„ (E2) 
We look back the history of Chinese. (E2') 

Based on the matching algorithm, two substring CH L and 
CH R of polyphone target(w p = c f) for sentence (E2); 

CH L ="iP1l°im##+", 

Upon the word segmentation, the Chinese word and 
pronunciation are as follows: 



CH L 


CH R 




83 






3542 








487 





According the priority of length of word first, 4 1 HA 
(Chinese people) will decide the pronunciation of 4 1 as ik 
XL. 

W^&gfflJSJIClSJfSi^o (E3) 
Read the Chinese and then pronounce in Canton. (E3') 



Chinese words in CH L 


Chinese words in CH R 




83 


!tX/-4 


+ £ 


343 





mm&mmm^mism. (E4> 

The path winds along mountain ridges, then 

watch the reflection of China. (E4') 



Chinese words in CH L 


Chinese words in CH R 




83 




+s 


3542 





$>&W9iUfcMtftm.^o (E5) 

The future forecast of Academic Sinica of Chinese. (E5 ' ) 



Chinese words in CH L 


CH R 


NULL 


+ * 


2979 






50 


HlXL 



In example (E5), only CH K contains the segmented words. 
On the other hand, there are no any word in CH L 

Method 2: Language Model (LMs) 

The experiment results of three models unigram, 
bigram, trigram are listed in Table 3. Bigram LMs achieves 
92.58%, which is highest rate among three models. 

Method 3: Voting Scheme 

1) Winner take all: Three models; unitoken, betoken and 

tritoken are generated. As shown in Table 4. Bitoken 
achieves highest PR of 90.17%. 

2) Preference: Three models; unitoken, bitoken and tritoken 

are generated. As shown in Table 5. Bitoken preference 
scoring can achieves highest PR of 92.72% in average. 

C. Word 2007 precision rate 

MS Office is a famous and well-known editing package 
around world. In our experiments, MS Word 2007 is used 
to process the transcription on same testing sentences. PR 
achieves 89.8% in average, as shown in Table 6. 

D. Results Analysis 

In the paper, voting scheme of preference and 
winner-take-all scoring, and statistical language Model 
have been proposed and employed to resolve the issue of 
polyphone ambiguity. We compare these methods with MS 
Word 2007. Preference bitoken scheme achieves highest 
PR among these models and achieves 92.72%. It is apparent 
that all our proposed methods are superior to MS Word 
2007. 

In the following, two examples are shown for correct 
and wrong prediction by Word 2007. 

l-£ u ft rx\ ct n-q 1-4 ■■ 'ds mm 

ftf4ttiI»!J 7 a ® 

Government handed over a blank examination paper in education 
and society, (correct prediction) 

"3* BXcT X 04 ^ P -9 P U 

® m m a is i w i m 

Talking to oneself as if nobody is around.(wiong prediction) 

We have constructed an intelligent e-learning system 
[18] based on the unify approach proposed in the paper. The 
system provides the function of Chinese Synthesized 
speech and display sereral useful lexical information, such 
as transliteration, Zhuyin and 2 pinyins for learning 
Chinese. 

All the functions such as Chinese polyphones prediction 
addressed in the paper, transliteration and transcription 
described above are integrated together in the e-learning 
website to provide online searching and translation through 



©2011 ACEEE 

DOI: 01.IJSIP.02.01.211 



23 



ACE EE 



ACEEE Int. J. on Signal & Image Processing, Vol. 02, No. 01, Jan 201 1 



Internet. If the predicted category is wrong, user may 
feedback the right category of polyphone to online 
gradually adapt the system's prediction for Chinese 
polyphones. 

V. Conclusion 

In the paper, we used several methods to address the 
issue of ambiguity of Chinese polyphones. First, three 
methods are employed to predict the category of polyphone: 
dictionary matching, language models and voting scheme; 
the last method has two different scoring schemes: 
winner-take-all and preference scoring. Furthermore we 
propose the effective unify approaches, which unify the 
several methods and then adopt better alternatives triggered 
based on a threshold, to improve the prediction. 

Our approach outperforms MS Word 2007 and 
statistical language models, and the best result of final 
outside testing achieves 92.72%. The proposed approach 
can be applied to related issues on other language. 

Based on the proposed unify approach, we have 
constructed the e-learning system in which several related 
functions of Chinese text transliteration are integrated to 
provide on-line searching and translation through Internet. 
In future, several related issues should be studied 
furthermore: 

1 . Collecting more corpus and extend the proposed 
methods to other Chinese polyphones. 

2. More lexical features, such as location and 
semantic information, used to enhance the 
precision rate of prediction. 

3. Improving the smoothing techniques for 
unknown words. 

4. Bilingual translation for English and Chinese. 
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Table 2: 10 Chinese polyphonic characters; its category and meanings. 



target 


Zhuyin Fuhau 


Chinese word 


hanyu pinyin 


English 




ixz- 




zhong xin 


center 






zhong du 


poison ! 


m 


t L- 




cheng fa 


multiplication 


PL' 




da sheng 




e 


«^ 




ganjing 


clean 


< ~W 




qian kun 


the universe 


7 


tit 




wei le 


in order to ' 




78? 


liao jie 


understand 








pang bian 


beside 




mm 


bang wan 


nightfall 






yi shan bang 
shui 


near the mountain and by the 
river 


ft 


pxr 


ift 


gong zuo 


work 




ft» 


zuo yi 




VX.Z' 


ft* 


zuo xing 




m 


!tt- 


tt» 


mang zhe 


busy 


i±i&. 




zhao ji 


anxious 




»s 


zhao xiang 


to bear in mind the interest of 


MiX - 




zhu 


famous 






zhuo 


inflexible 


m 






kao juan 


a test paper 






Juan fa 


curly hair 


< U^' 


^ ft 


quan qu 


curl 


m 




°B»t 


yan hou 


the throat 


-w 


Sfl 


tun yan 


swallow 


-tt' 


"IB 


geng ye 


to choke 




-^xz.' 




cong shi 


to devote oneself 


PX1' 




pu zong 


servant 






cong rong 


calm; unhurried 


PX/. 


«£« 


zong heng 


in length and breadth 


Table 3 : PR of outside testing on Language Model. 



token 




* 


n 


7 


® 


ft 


m 


m 


m 




avg. 


unigram 


95.88 


86.84 


92.31 


70.21 


85.71 


96.23 


75.32 


100 


98 


91.67 


89.98 


bigram 


96.75 


84.21 


96.15 


85.11 


92.86 


94.34 


81.17 


96.30 


100 


93.52 


92.58* 


trigram 


80.04 


57.89 


61.54 


58.51 


78.57 


52.83 


60.39 


62.96 


88 


71.30 


70.50 



ps: * denotes the best PR among three n-gram models. 

Table 4: PR of outside testing on Winner-take-all scoring. 



token 


<¥ 


* 




7 


m 


ft 


m 


m 


m 


«£ 


avg. 


unitoken 


96.96 


84.21 


80.77 


57.45 


71.43 


94.34 


58.44 


85.19 


84 


87.04 


84.69 


bitoken 


96.75 


86.84 


96.15 


79.79 


85.71 


92.45 


68.83 


100 


98 


93.52 


90.17* 


tritoken 


79.83 


60.53 


61.54 


60.64 


78.57 


52.83 


59.74 


66.67 


88 


71.3 


70.69 



ps: * denotes the best PR among three n-gram models. 
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Table 4: PR of outside testing on Preference scoring. 



token 






ft 


7 


® 




* 


« 


ng 




avg. 


unitoken 


96.96 


84.21 


80.77 


70.21 


71.43 


94.34 


70.13 


85.19 


88 


87.96 


87.76 


bitoken. 


96.75 


86.84 


96.15 


87.23 


85.71 


93.40 


81.17 


100 


98 


93.52 


92.72* 


tritoken. 


80.04 


60.53 


61.54 


60.64 


78.57 


52.83 


59.74 


66.67 


88 


71.30 


70.78 


ps: * denotes the best PR among three rc-gram models. 






















Table 6: PR of Word 2007 on same testing 


sentences. 






token 




* 


ft 


7 


m 




* 


m 


m 


ft 


avg. 


word 2007 


93.37 


76.47 


76.67 


83.65 


78.57 


93.70 


78.33 


82.76 


100 


91.51 


89.80 
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